Unsupervised Lexical Acquisition for Part of Speech Tagging
نویسندگان
چکیده
It is known that POS tagging is not very accurate for unknown words (words which the POS tagger has not seen in the training corpora). Thus, a first step to improve the tagging accuracy would be to extend the coverage of the tagger’s learned lexicon. It turns out that, through the use of a simple procedure, one can extend this lexicon without using additional, hard to obtain, hand-validated training corpora. The basic idea consists of merely adding new words along with their (correct) POS tags to the lexicon and trying to estimate the lexical distribution of these words according to similar ambiguity classes already present in the lexicon. We present a method of automatically acquire high quality POS tagging lexicons based on morphologic analysis and generation. Currently, this procedure works on Romanian for which we have a required paradigmatic generation procedure but the architecture remains general in the sense that given the appropriate substitutes for the morphological generator and POS tagger, one should obtain similar results.
منابع مشابه
Part-of-Speech Tagging in Context
We present a new HMM tagger that exploits context on both sides of a word to be tagged, and evaluate it in both the unsupervised and supervised case. Along the way, we present the first comprehensive comparison of unsupervised methods for part-of-speech tagging, noting that published results to date have not been comparable across corpora or lexicons. Observing that the quality of the lexicon g...
متن کاملThe Role of Private Speech Produced by Intermediate EFL Learners in Lexical Language Related Episodes
Private speech utilization is accepted to have a critical role in the continuum of language acquisition. As a valuable device in studying learners’ talk during interaction, a language related episode (LRE) is any part of a dialogue where a student speaks about a language problem s/he comes across while completing a task. The present study investigated the role of private speech produced by Inte...
متن کاملImproved Part-of-Speech Tagging for Online Conversational Text with Word Clusters
We consider the problem of part-of-speech tagging for informal, online conversational text. We systematically evaluate the use of large-scale unsupervised word clustering and new lexical features to improve tagging accuracy. With these features, our system achieves state-of-the-art tagging results on both Twitter and IRC POS tagging tasks; Twitter tagging is improved from 90% to 93% accuracy (m...
متن کاملAn Unsupervised Model for Instance Level Subcategorization Acquisition
Most existing systems for subcategorization frame (SCF) acquisition rely on supervised parsing and infer SCF distributions at type, rather than instance level. These systems suffer from poor portability across domains and their benefit for NLP tasks that involve sentence-level processing is limited. We propose a new unsupervised, Markov Random Field-based model for SCF acquisition which is desi...
متن کاملLearning Part-of-Speech Guessing Rules from Lexicon: Extension to Non-Concatenative Operations
One of the problems in part-of-speech tagging of real-word texts is that of unknown to the lexicon words. In (Mikheev, 1996), a technique for fully unsupervised statistical acquisition of rules which guess possible parts-ofspeech for unknown words was proposed. One of the over-simplification assumed by this learning technique was the acquisition of morphological rules which obey only simple con...
متن کامل